Density Estimation over Data Streams
نویسندگان
چکیده
A growing number of real-world applications share the property that they have to deal with transient data arriving in massive volumes, so-called data streams. The characteristics of these data streams render their analysis by means of conventional techniques extremely difcult, in the majority of cases even impossible. In fact, to be applicable to data streams, a technique has to meet rigid processing requirements such as a single pass over the stream and strict limitations on the available memory. With respect to these requirements, an adequate real-time mining and analysis of a data stream has turned out to be a highly demanding task, which can only be tackled by developing tailored techniques. In this work, we delve more deeply into an important building block of many data mining and analysis approaches, namely density estimation, with the objective to adapt it to the data stream scenario. Density estimation, a vital research topic in mathematical statistics, aims to capture the unknown distribution of a real-valued data set by estimating its underlying probability density function. A suitable density estimate can be exploited to gain insight into the main features of the data; it can also prepare the ground for a variety of further analysis tasks. Due to the importance of density estimation, various methods have been developed in recent decades. Among the most promising methods are the nonparametric ones. They stand out from the parametric approaches as they only make very weak assumptions on the data; they let the data speak for themselves. Two important members of the class of nonparametric methods are based on kernels and wavelets. Kernel as well as wavelet density estimators are theoretically founded and practically approved, but their computational burden severely impedes their applicability. In particular, neither kernel nor wavelet density estimators can be computed over real-valued data streams because their computational complexity collides with the processing requirements for streams. However, as density estimation can be a crucial component of data stream analysis, we devoted ourselves to the adaptation of kernel and wavelet density estimators to data streams in compliance with the processing requirements for streams. Our solution for wavelet density estimators, termed compressed-cumulative WDEs, relies on processing blocks of data of the stream, where each data block is associated with a separate estimator. While processing the stream, the block estimators are successively merged into one estimator. Our solution for kernel density estimators utilizes speci c building blocks, termed Cluster Kernels, to derive an estimator for the stream on demand. These Cluster Kernels summarize already processed elements in terms of local statistics, which are used for an approximate resampling of elements. Both solutions exhibit a generic design to facilitate the integration of new strategies for their main processing steps. In order to assess them, we conducted an extensive experimental study, including a thorough comparison with competitive techniques. As the results of this study reveal, our techniques outperform all competitors. Overall, they succeed in estimating the unknown density of a data stream.
منابع مشابه
Towards Kernel Density Estimation over Streaming Data
A variety of real-world applications heavily relies on the analysis of transient data streams. Due to the rigid processing requirements of data streams, common analysis techniques as known from data mining are not applicable. A fundamental building block of many data mining and analysis approaches is density estimation. It provides a well-defined estimation of a continuous data distribution, a ...
متن کاملStream Mining via Density Estimators: A Concrete Application
Many real-world applications share the property that the data they process arrives in streams. The transient and volatile nature of these streams renders the application of common processing and analysis techniques difficult. In particular, the mining of streams has proved to be a difficult task due to the rigid processing requirements that must be met within the data stream scenario. We propos...
متن کاملContinuous Adaptive Outlier Detection on Distributed Data Streams
In many applications, stream data are too voluminous to be collected in a central fashion and often transmitted on a distributed network. In this paper, we focus on the outlier detection over distributed data streams in real time, firstly, we formalize the problem of outlier detection using the kernel density estimation technique. Then, we adopt the fading strategy to keep pace with the transie...
متن کاملA PCA-Based Change Detection Framework for Multidimensional Data Streams
Detecting changes in multidimensional data streams is an important and challenging task. In unsupervised change detection, changes are usually detected by comparing the distribution in a current (test) window with a reference window. It is thus essential to design divergence metrics and density estimators for comparing the data distributions, which are mostly done for univariate data. Detecting...
متن کاملMaintaining Nonparametric Estimators over Data Streams
An effective processing and analysis of data streams is of utmost importance for a plethora of emerging applications like network monitoring, traffic management, and financial tickers. In addition to the management of transient and potentially unbounded streams, their analysis with advanced data mining techniques has been identified as a research challenge. A well-established class of mining te...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007